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1.4  Report  Organization 

This  report  is  organized  into  i  chapters.  Chapter  2  provides  an  overview  of  the  problems 
encountered  in  data  distribution  and  reviews  the  relevant  literature  in  this  area.  Chapter  3 
provides  an  overview  of  the  project's  design,  Qiapter  4  contains  an  extensive  CKample  to 
demonstrate  the  functionahty  and  use  of  the  tool.  Chapter  5  gives  concluding  remarks  and 
suggests  possible  extensions. 
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CHAPTER  2 
THE  DATA  DISTRIBUTION  PROBLEM 

2.1   Objective 

The  objective  ot  this  chapter  is  to  summaiize  the  many  problems  addressed  in  the  literature 
concerning  data  distribution.  A  high  level  understanding  of  these  problems  and  their 
interdependencies  will  familiarize  the  reader  with  the  complexity  these  problems  introduce  into  the 
distributed  database  design,  even  when  addressed  individually.  A  great  deal  of  research  has  been 
done  to  address  different  parts  of  the  data  distribution  problem.  Some  of  the  significant  work  in 
these  areas  will  be  reviewed.  To  date  however,  there  has  been  no  overall  solution  that  successfully 
addresses  all  of  the  problems  under  one  combined  methodology.  Further  research  continues  to 
look  for  optimal  solutions  to  these  issues. 

2.2  Why  Distribute  ? 

In  a  distributed  environment  there  are  clear  benefits  to  be  derived  in  distributing  data.  One 
obvious  advantage  is  the  ability  to  store  the  data  at  the  location(s)  where  it  is  most  frequenUy 
used.  This  will  achieve  a  faster  response  time  and  reduce  communication  costs  in  a  querj' 
intensive  apphcation.  A  second  advantage  in  distributing  data  is  the  potential  to  store  what  might 
be  a  very  large  database  on  smaller  machines  through  partitioning  ot  splitting  the  data 
(partitioning  will  be  reviewed  in  detail  in  a  later  .section).  As  a  final  example,  the  distiibution  of 
data  increases  the  overaU  reUabiUty  of  the  system  by  tiie  simple  fact  that  all  data  is  not  stored  on 
one  machine  (i.e.,  subject  to  single  site  failure)  [ROTH8!].  Each  of  these  points  demonsti-ate 
clear  advantages,  but  maximizing  these  benefits  requires  important  tradeoffs. 
2.3  Distribution  Schemes  and  Associated  Problems 
Consider  the  optimal  allocation  of  a  file  in  a  distributed  environment.  Intuition  would  tell  you  to 


store  the  data  where  it  is  most  frequently  used.  !f  each  site  o\vned  and  used  the  data  exclusively 
(i.e.,  no  other  site  needed  any  of  this  data)  there  wouid  be  no  design  problem.  All  data  updates 
and  queries  at  this  node  could  be  handled  locally,  tn  reality  though,  this  is  not  the  case  as 
applications  have  data  that  is  shared  among  several  users.  These  applications  may  also  have  strict 
requirements  as  to  response  time,  reliability  and  consi.stency  of  the  data  they  require.  For  this 
reason,  careful  analysis  of  the  distribution  scheme  to  be  used  is  required. 

Diso-ibutior.  schemes  generally  fall  into  two  main  categories:  partitioned  systems  and  replicated 
systems  [DRAF80). 

2.3. 1   Partitioned  Systems 

Partitioning  is  the  process  of  "assigning  a  local  object  (relation)  from  the  logical  schema  of  the 
database  to  several  physical  objects  (files)  stored  in  the  database"  [BRAY81].  A  pure  partitioned 
system  has  no  duplication  or  replication  of  data  items.  Vertical  partitioning  (or  partitioning  by 
structure)  divides  the  data  by  columns  or  attributes.  Apphcation  of  vertical  partitioning  wouid  be 
desirable  in  cases  where  only  certain  attributes  of  the  logical  record  are  needed  at  locations.  An 
example  taken  from  [DRAF801  wiU  help  illustate  this  point.  This  example  considers  a  relation 
for  orders  of  parts  as  follows: 

ORDERS  (CUST  #,CNAME,PART  NO.PART  DESCQUANTTTY  ORDERED) 

One  site  may  only  be  concerned  with  the  PART_NO  and  PART_DESC,  while  another  site  may 
maintain  the  CUST^NG  and  CUST.NAME.  In  this  instance  there  are  benefits  derived  from 
partitioning  the  data  vertically,  as  each  user  can  locally  conOxil  the  information  with  which  they 
are  directly  concerned. 

Horizontal  partitioning  divides  a  relation  by  occunence  or  tuple.  This  type  of  partitioning  is 
valuable  in  cases  where  files  can  be  distributed  based  upon  given  data  values.   Again,  using  the 


ORDERS  relation,  each  site  may  need  ail  infonnation  in  the  relation  but  may  only  deal  with  one 
type  of  part,  in  this  case  all  tuples  associated  with  a  given  part  are  assigned  to  that  particiilar 
node. 

2.3.2  Replicated  Schemes 

Replication  is  the  allocation  of  a  single  file  of  the  database  to  multiple  sites.  If  there  is  no 
replicacon  the  distributed  database  problem  is  significantly  reduced  in  the  area  of  concurrency 
control  and  synchronization,  but  the  cost  of  doing  a  transaction  may  increase  significantly  when 
the  transaction  must  access  data  from  a  number  of  sites.  The  opposite  extreme  is  full  redundancy 
where  each  file  is  present  at  every  node.  This  method  will  optimize  response  time  in  the  case  of 
queries,  but  will  severely  impact  update  performance  and  cost  as  all  updates  must  be  propagated 
to  every  site.  Partial  redundancy  is  the  balance  between  these  two  extremes.  Redundancy  is 
important  in  achieving  much  of  the  promise  of  distributed  systems.  Without  redundancy 
reliability  goals  can  only  partially  be  met  as  the  unavailability  of  even  a  single  file  may  be  .seen  as 
"total  failure"  to  some  applications.  Without  redundancy  the  choice  of  where  to  store  data  is  an 
all  or  nothing  situation.  Finally,  in  reality  updates  tend  to  be  small  and  simple  where  queries  tend 
to  be  quite  complex  and  involve  large  amounts  of  data  [R0TH81].  In  this  type  of  situation 
redundancy  would  have  clear  benefit  but  again  would  increase  the  complexity  of  the  system  from 
a  concurrency  and  synchronization  point  of  view. 

2.3.3  Summary 

There   are  several  generalizations  that   have  been  made  concerning  the   file  allocation   and 

replication  issues. 
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Champine  reviews  the  problem  in  regard  to  the  size  of  the  file  and  the  percentage  of  exception 
rate  (or  remote  request).   Figure  2.1  summarizes  his  view  [CHAM81]: 

Exception  Rate      File  Size      Distribution  .Scheme 


small  replicate 

small  large  partition 

laige  large  centralize 


Distribution  Scheme 
Figure  2.1 


His  view  simply  states  that  if  the  file  size  is  small,  rephcate  the  file  and  propagate  updates.  If  the 
exception  rate  is  small  but  the  file  size  is  large,  the  best  solution  is  to  partition  the  data  in  order  to 
place  the  fragments  where  they  are  most  frequently  used.  Finally,  if  the  file  size  is  large  and 
exhibits  a  high  exception  rate  it  may  be  necessary  to  centralize  the  file.  In  other  cases  where  the 
size  of  the  file  has  not  been  considered,  studies  have  indicated  that  the  maximum  number  of 
copies  of  a  file  should  be  one  unless  the  ratio  of  queries  to  updates  is  greater  than  or  equal  to  50% 
(ratio=.5)  [MUR085],  [CASE72].  Although  tiiese  views  are  gross  simplifications  of  the  data 
distribution  problem,  they  appeal  to  intuition  when  considering  the  trade-offs  between  update 
costs,  communication  costs  and  response  time.  In  addition,  they  serve  to  highlight  the  importance 
of  understanding  an  application's  use  of  data.  It  should  be  clear  from  the  discussion  thus  far  that 
one  cannot  reasonably  decide  where  to  put  data  unless  they  know  where  it  will  be  accessed  from 
and  how  often. 

In  summary,  the  data  distribution  problem  involves  a  clear  understanding  of  an  application  and  its 
use  of  data.  Data  allocation  however  is  only  one  aspect  in  the  design  of  a  distributed  processing 
system.  The  allocation  of  hardware  and  software  must  be  considered  in  the  overall  design,  along 
with  the  allocation  of  data.  The  number  of  design  factors  increa.ses  substantially  as  decisions  must 
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include  such  things  as  network  topology,  channel  bandwidths,  number  of  processor,  storage 
capadties,  program  locations  as  well  as  data  locations  [frEVN84].  All  of  these  decisions  are 
highly  interdependent  and  involve  trade-offs  in  the  areas  of  reliability,  performance,  development 
complexity  (synchronization  and  recovery),  growth  and  overall  system  cost  (MARI841.  Due  to 
the  complexity  of  modeling  all  of  these  factors  concurrently,  most  research  has  considered  only 
individual  design  problems  or  the  combination  of  only  a  ff,w  [HEVN84]. 

The  remainder  of  this  chapter  will  first  review  the  research  done  in  the  area  of  "pure"  file 
allocation  and  then  survey  extensions  to  these  earlier  models,  which  take  mto  account  hardware 
considerations  and  file  interdependencies. 

2.4  File  AJlocation  Solutions 

The  first  attempts  to  deal   with  the  problem  of  optimal  file  placement  used   mathematical 

programming  techniques.    Nearly  all  of  these  models  were  Unear  integer  programming  problems 

to  find  optimal  solutions.    Most  of  these  models  were  driven  by  system  requirements  as  to 

performance,  cost,  minimum  access  delays,  data  management  overhead  and  storage.   Specific  data 

models  were  assumed,  with  variants  appUed  to  different  system  resources.    Generally  however, 

these  models  worked  under  the  following  requirements  [CHAM81]: 

f 

Given:   A  description  of  user  demand  for  service  stated  as  volume  of  requests  from  each 
node  of  a  network  to  each  file 

Given:  A  description  of  resources  available,  such  as  network  topology,  link  capacity,  cost 
of  storage,  communications  cost,  etc. 

Determine:  An  assignment  of  files  to  nodes  which  minimizes  total  costs. 
2.4.1   'Pure'  File  Allocation  Models 

The  simpliest  file  allocation  model  was  introduced  by  Casey  [CASE72].  His  model  looks  for  the 
optimal  node  assignment  of  a  single  file  which  minimizes  the  total  communications  cost,  under  the 
assumption  of  a  fully  connected  network  with  no  response  rime  or  memory  restrictions  imposed. 
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One  important  contribution  of  liis  model  is  the  distinction  made  between  update  and  query 
requests.  In  the  case  of  queries  it  is  assumed  that  the  copy  of  the  file  wh:ch  minimizes  total 
communication  costs  is  chosen,  whereas  updates  arc  propagated  to  all  file  copies.  Tlie  model 
represents  the  total  communication  cost  as  a  sum  of  the  cost  over  individual  nodes  that  result  from 
a  given  file  allocation.    Casey's  cost  equation  is  paraplirased  below  [CASE72]; 

where:  ^       '"■  "' 

I  =  index  set  of  system  nodes 

j  =  index  for  nodes 

k  =  index  for  file  locations 

STR       =  fixed  cost  of  storage  for  locating  files 

at  kth  node 
QT         =  query  traffic  (emanating  from  node  j) 
UT         =  update  traffic  (emanating  from  node  j) 
QC         =  cost  of  unit  of  communication  from  node  j 

to  k  for  query 
UC         =  cost  of  imit  of  communication  from  node  j 

His  model  demonstrated  that  query  costs  decreased  as  the  number  of  copies  of  the  file  were 
increased;  however,  a  penalty  is  paid  for  storage  and  update  costs.  Rephrased,  if  storage  costs 
were  low  and  there  were  no  updates,  complete  duplication  would  be  cost  effective.  If  storage 
costs  were  hi^  and  update  activity  was  high,  ^hca  one  copy  of  the  file  would  be  optimal.  Casey 
analyzes  this  trade-off  by  examining  the  cost  function  as  the  number  of  fife"  copies  is  inaeased. 
This  is  done  using  a  directed  graph  (referenced  as  "cost  graph")  where  each  vertex  is  a  file 
assignment  and  has  an  associated  value  from  the  cost  function.  The  edges  of  the  graph  are  paths 
corresponding  to  the  addition  of  a  single  file.  Casey  demonstrates  the  monotonicity  of  this  gi^aph, 
implying  that  it  is  "sufficient  to  follow  every  path  of  the  cost  graph  until  the  cost  increases,  and  no 
more"  [CASE72].  This  bounds  the  number  of  computations  in  both  breadth  and  depth 
(commonly  referred  to  as  a  Branch  and  Bound  Search).  Therefore,  only  a  subset  of  the  original 
tree  (of  2  to  the  power  of  "n"  nodes)  needs  to  be  tested.   Even  with  this  property,  Casey's  model 
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was  proven  by  Eswaran  to  be  NP-complete  suggesting  that  heuristics  were  needed  to  efficiently 
deal  with  the  probtem  [HEVN84]. 

Another  approach  to  the  problem  was  introduced  by  Chu.  His  model  sought  to  minimize  overall 
operdring  costs  by  determining  the  optimal  placement  of  files  under  the  constraints  of  response 
time  and  storage  capacity.  No  distinction  was  made  between  query  and  update  requests 
[CHU691,  This  model  again  introduced  a  very  large  number  of  variables  for  even  small  problems, 
making  it  very  costly,  and  in  large  problems  computationally  infeasible  [CERI83]. 

Several  comprehensive  reviews  of  other  models  which  address  the  pure  file  allocation  problem  can 
be  found  in  [CERI83a],  [HEVN84],  [LEVI79].  Most  of  the  proposed  solutions  apply  different 
heuristic  measures  to  reduce  the  computational  complexity  of  the  models.  In  general  all  of  the 
models  in  this  category  assume  a  completely  connected  network  topology  and  a  completely 
defined  distribution  scheme  in  terms  of  storage  capacity,  storage  costs,  communication  costs, 
frequencies,  and  user  requests  [HEVN84].  With  all  of  these  parameters  well  defined  they  attempt 
to  obtain  an  optimal  design  for  a  very  specific  problem,  namely  the  allocation  of  a  single  file. 
One  model  which  was  introduced  by  Morgan  and  Levin  deserves  special  note,  as  it  points  out 
another  weakness  in  the  models  discussed  so  far. 

Morgan  and  Levin's  model  distinguishes  itself  from  others  by  considering  both  program  and  data 
allocation.  Their  work  points  out  the  importance  of  considering  the  dependencies  between 
programs  and  data  in  a  heterogeneous  environment.  The  dependency  points  out  tfiat  while  data 
can  move  easily  from  node  to  node,  programs  cannot,  as  in  a  heterogeneous  environment  different 
hardware  and  system  software  will  exist.  In  this  environment,  program  execution  is  limited  to 
certain  nodes.  This  is  important  as  a  transaction  at  one  node  may  invoke  a  program  at  another 
node,  which  in  turn  needs  data  from  a  third  node.  Their  model  considers  this  restriction,  by 
analyzing  the  optimal  placement  of  botii  data  and  programs.    In  addition  their  model  also 


considers  dynamic  behavior  and  uncertain  demand.  Most  of  the  previous  models  assumed  "static" 
behavior,  meaning  that  once  the  frequency  of  requests  was  determined  they  would  not  change. 
Also  assumed  was  the  complete  availability  and  accuracy  of  access  patterns.  Morgan  and  Levin's 
model  deals  with  dynamic  behavior  and  incomplete  information  by  minimizing  costs  over  a 
number  of  different  time  periods  during  which  allocation  may  change  [LEVI79].  The  practicality 
of  changing  the  disoibunon  over  time  has  been  debated  however.  As  pomted  out  by  [GRO.S80], 
once  a  distribution  scheme  is  set  up  it  remains  fairly  static  in  practice  as  redistribution  could 
involve  a  tremendous  amount  of  effort. 

Ceri.  et.  al.  and  Hevner  review  similar  work  done  by  Fisher  and  Hochbaum 
[CERI83a],[HEVN84].  This  model  is  an  optimization  algorithm  for  placing  multiple  copies  of 
programs  and  databases  over  a  network.  They  improve  the  work  done  by  Morgan  and  Levin  by 
developing  several  heuristics  to  generate  feasible  solutions  to  the  problem  and  report  practical 
experience. 

2.4.2  File  and  Hardware  Allocation  Models 

Various  extensions  have  been  made  to  the  file  allocation  models  discussed  above,  which  explore 
different  sets  of  assumptions  and  problems.  These  approaches  relaxed  some  of  the  restrictions  of 
other  models  by  addressing  such  things  as  channel  capacities  and  network  topologies. 

Mahmoud  and  Riordan  consider  the  combined  problem  of  optimal  file  allocation  and  channel 
capacity  determination  given  a  fixed  network  topology.  The  objective  of  the  model  is  to  minimize 
communication  cost  and  storage  cost  subject  to  network  delay  and  average  file  availability.  The 
model  was  a  nonlinear  integer  programming  problem  making  it  quite  expensive;  thus,  an  efficient 
heuristic  was  developed  [MAHM76].  Their  results  have  produced  reasonably  good  allocation 
solutions  [HEVN84]. 
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Irani  combines  the  file  allocation,  network  topology  and  channel  capacity  allocation  into  a  single 
problem.  The  model  minimizes  the  total  cost  of  file  storage  and  communication  capacities  over 
different  channels.  The  constraints  imposed  on  the  model  include  a  maximum  communication 
delay,  minimum  availability  of  single  files  and  a  minimum  level  of  network  reliability  [1RAN79]. 

As  a  final  example,  Hevner  reviews  the  work  done  by  Casey  [HEVN84].  Casey  extended  his 
original  model  by  including  optimal  selections  of  network  topology  and  channel  capacities.  Due 
to  the  size  of  the  problem  the  topology  considered  is  restricted  to  tree  networks.  Again,  Casey 
developed  heuristic  techniques  to  solve  the  problem  as  the  onginal  model  was  nonlmear  and 
contained  integer  and  continuous  variables. 

2.4.3   Summary 

In  summary,  earlier  models  developed  in  this  category  focused  solely  on  the  file  allocation 
problem,  leaving  all  other  factors  invariant.  Extensions  were  made  to  these  moiJels  to  deal  with 
program  allocation  and  hardware  allocation  design  choices.  Several  generalizations  can  be  applied 
to  these  models.  Most  of  these  solutions  were  integer  linear  programming  problems  (i.e.,  each 
design  parameter  had  a  0  or  1  allocation).  These  were  far  too  costly,  and  hence  iafeasible  to  run 
for  systems  of  any  real  size.  Heuristic  techiuques  were  therefore  developed  to  deal  with  the 
problems.  These  heuristics  sacrifice  optimality  for  practicality  of  use.  Even  with  the  use  of  these 
heuristics,  the  complexity  of  the  problem  has  limited  most  research  to  dealing  with  the 
optimization  of  at  most  two  resources  at  once  [HEVN84]. 

Other  major  drawbacks  of  these  models  include  their  assumptions  in  regard  to  file  usage  and 
partitioning.  These  models  do  not  address  interdependencies  between  files  which  appear  in 
realistic  databases.  They  assume  access  of  a  given  file  from  a  given  node,  and  do  not  reflect  the 
demand  for  data  access  involving  more  than  one  file.  As  pointed  out  by  [ROTH81],  consider  a 
join  where  each  of  two  nodes  has  a  file  used  in  the  new  relation.   A  join  request  involving  these 


two  relarions  could  involve  substantial  communication  costs.  A  file  allocation  scheme  that  places 
both  relations  involved  in  this  query  at  a  single  site  may  be  far  more  advantageous  than  if 
distributed  (Although  true  in  principal,  this  view  may  be  slightly  exaggerated  as  the  amount  of 
data  involved  couid  be  reduced  by  a  query  optimization  technique  that  utilizes  semi-joins 
[HEVN84]).  Finally,  these  models  assume  that  complete  files  should  be  the  unit  of  assignment  of 
data  to  nodes.  There  is  no  consideration  given  to  partitioning  the  files  in  order  to  reduce  access 
and  storage  costs.  While  these  models  are  important,  they  often  obtain  the  optimal  design  for  a 
very  specific  problem.  The  use  of  these  models  may  be  highly  advantageous  after  the  partitioning 
problem  has  been  addressed  and  the  physical  distributed  system  is  designed. 

2.5  File  Dependency  Methodologies 

As  stated  above,  one  of  the  major  problems  with  the  pure  file  allocation  models  is  the  assumption 
of  a  single  file  being  the  imit  of  distribution  with  no  consideration  given  to  file  dependencies. 
Research  in  this  area  proposes  methodologies  and  solution  methods  that  consider  the  entire 
database  schema  as  opposed  to  individual  files.  Some  of  the  research  presents  guidelines  and 
classification  schemes  to  be  followed  during  logical  distribution  design  (non-automated).  A 
notable  contribution  in  this  area  is  the  work  done  by  Baiter  [BAKE].  His  methodology  points  out 
the  major  issues  that  must  be  addressed  during  logical  design.  Other  contributions  propose 
theoretical  models  and  heuristic  algorithms  to  determine  the  logical  distribution.  Examples  from 
each  category  are  reviewed  below. 

2.5.1   Baker's  Model 

Baker  proposes  a  methodology  in  which  logical  distribution  is  defined  as  a  "partition  of  a 
collection  of  related  applications  and  their  data  into  a  maximum  number  of  groups  that  have  a 
specified  low  level  of  interdependence".  'Intemodal  dependencies'  arise  when  an  application 
tiansaction  (or  program)  requires  data  from  a  remote  node.   The  goal  then  is  to  minimize  the 
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intemoda!  dqx;ndencies  and  move  toward  'nodal  autonomy'  or  'nodal  interdqwndency'  [BAKE]. 

In  Bakers  model,  transaction  or  intergroup  dependencies  have  an  orientation,  meaning  an 
ownership  of  the  transaction.  Several  quantitative  measures  are  outlined  that  can  be  used  in 
different  combinations  to  classify  intergroup  dependencies.  These  include  frequency  of  use, 
pattern  of  usage,  required  currency,  level  of  consistency  and  timing  required  and  the  degree  of 
remote  data  needed.  In  addition  other  classifications  can  be  used  to  rate  the  dependenaes,  such 
as  read  versus  update  requests  (where  read  would  have  a  weaker  dependency)  or  dcferrability  of 
the  transaction. 

Baker's  approach  to  the  design  is  iterative  in  nature  and  follows  through  six  steps.  These  steps 
and  their  associated  activities  can  be  summarized  as  follows: 

1.  Data  Gathering:  Determine  the  number  and  types  of  databases,  database  structures, 
relations  among  the  databases  and  information  regarding  the  applicarion  make-up  (i.e.,  split 
geographically  or  by  function) . 

2.  Define  Application  Groupings:  Define  application  group  structure  such  as  order  entry, 
production  planning,  etc. 

3.  Assign  Applications  to  Groupings:  Here  a  complete  application  which  consists  of  a  set  of 
application  programs  is  assigned  to  one  group  only. 

4.  Assign  Databases  to  Application  Groupings:  Based  upon  knowledge  of  die  applications  and 
the  data  they  require,  assign  databases  to  each  application  group  (i.e.,  applications  that 
make  the  most  updates  to  data). 

5.  Assign  Transactions  to  Databases:  Here  step  4  is  ignored,  and  each  transaction  is  assigned 
to  the  group  that  contains  the  data  that  are  most  closely  related  to  the  program  or 
n-ansaction.  This  is  included  to  overcome  the  possibility  of  being  assigned  to  an  application 
group  where  the  data  the  program  most  frequendy  uses  is  in  another  group. 

6.  Analyze  Dependencies  and  Evaluate  Distribution:  The  objectives  of  diis  step  are  to  minimize 
the  communication  traffic  and  minimize  the  amount  of  data  tiiat  must  be  copied  between 
nodes.  This  step  will  be  further  described  below. 

In  analyzing  dependencies  and  evaluating  the  distribution  (step  6),  the  dependency  of  each  pair  of 
groups  is  calculated  from  die  set  of  transaction  dependencies  that  occur  between  groups.  These 
dependencies  are  characterized  on  a  single  transaction  type  by  a)  an  orientation  which  relates  the 
local  to  remote  group;  (b)  the  active  component,  or  number  which  gives  the  frequency  of  use  of 
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the  transaction  per  day  and  (c)  the  passive  component,  or  number  which  represents  the  number  of 
bytes  of  remote  data  accessed  by  the  transaction.  When  dependencies  between  groups  are  due  to 
more  than  one  transaction  type,  the  dependencies  are  combined  by  adding  the  active  components 
and  measuring  the  union  of  passive  components.  Thresholds  are  tlien  established  judgementally  to 
allow  transactions  to  be  categorized  into  four  groups;  (HH)  meaning  it  is  used  frequently  and 
accesses  a  large  quantity  of  data,  (HL),  (LH)  and  (LL). 

One  constraint  placed  in  Baker's  model  is  that  a  logical  distribution  may  not  contain  any  HH 
dependencies.  The  dependency  between  two  groups  is  then  described  using  an  orientation  and 
three  consolidated  dependencies  (HL,  LH  and  LL).  Values  of  these  thresholds  determine  the 
number  of  groups  in  the  logical  distribution,  the  extent  of  the  group's  autonomy  and  the  strength 
of  the  dependencies.  Baker  continues  to  discuss  two  types  of  dependency  support,  namely  data 
communications  when  data  currency  and  integrity  are  important  or  data  duplication  which  yields 
good  response  time,  management  control  and  system  availabihty. 

Although  Baker's  approach  may  take  several  iterations  to  reach  a  satisfactory  distribution,  it 
highlights  the  importance  of  placing  the  data  processing  functions  and  associated  data  close  to 
their  asers.   Only  in  this  way  can  a  successful  degree  of  nodal  autonomy  or  an  acceptably  low  level 
of  interdependency  be  obtained,  allowing  the  benefits  of  distributed  processing  to  be  utilized. 
2.5.2  Theoretical  Models 

In  contrast  to  Baker's  trial  and  error  approach,  some  research  proposes  theoretical  methods  to 
obtain  logical  distributions.  Similar  to  the  data  allocation  models,  these  approaches  generally 
require  heuristic  techniques  to  make  their  use  practical  [NAVA84].  Many  of  these  models  are 
concerned  with  affinity  among  attributes  and  attribute  clustering  [CERI85],[NAVA84].  The 
work  done  by  Navathe,  Ceri  ,  Wiedertiold  and  Dou  will  be  reviewed  in  some  detail,  so  that  an 
understanding  of  the  clustering  techniques  can  be  achieved. 
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Navathe  et.al.  propose  a  set  of  algorithms  to  deal  with  the  vertical  partitioning  problem  (VPP). 
The  approach  consists  of  two  phases  where  i:i  the  first  phase  the  design  is  independent  of  specific 
cost  information.  The  second  phase  performs  cost  optimization  from  knowledge  of  a  specific 
application  environment.  The  model  also  deals  with  three  environments  for  the  vcrrical 
partitioning  problem;  a  single  .site  with  one  memory  level,  a  single  site  with  multiple  memory 
levels  and  multiple  sites  [NAVA84].  The  multiple  site  allocation  is  most  relevant  to  this  review. 

The  inputs  to  the  model  are  the  logical  accesses  of  the  transactions  to  the  attributes  (i.e.,  ntmiber 
of  accesses  to  object  instances  for  one  occurrence  of  a  transaction  at  a  site)  and  the  relevant  design 
parameters  such  as  cost  of  storage,  access  and  transmission.  The  steps  in  the  first  phase  are 
summarized  below  [NAVA84): 

1 .  Construct  Attribute  Affinity  Matrix  (AA  Matrix):  The  objective  of  this  function  is  to  construct 
an  AA  matrix  which  records  the  affinity  or  imaginary  bond  between  attributes.  The  affinity 
measure  is  based  on  the  logical  access  information  which  has  been  obtained.  This  includes 
whether  or  not  a  transaction  uses  a  particular  attribute,  whether  the  transaction  is  retrieval 
or  update  and  the  number  of  accesses  to  the  object  for  one  occurrence  of  the  tiransaction. 
The  affinity  measure  recorded  is  the  sum  of  these  accesses  per  time  period  (i.e.,  per  day). 

2.  Cluster  the  Attributes:  The  objective  of  this  function  is  to  group  tiie  AA  matrix  so  that 
attributes  with  high  affinity  are  clustered  together  as  are  attributes  with  low  affinity.  This  is 
accomplished  through  a  heuristic  algorithm  tiiat  diagonalizes  the  AA  matrix  to  produce 
blocks  of  jointiy  accessed  data  items. 

3.  Partitioning:  The  authors  provide  two  mechanisms  for  partitioning.  The  first  provides 
partitioning  for  non-overiapping  fragments.  This  attempts  to  find  the  ideal  location  on  the 
matrix  to  form  the  partition  of  two  non-overiapping  fragments,  so  that  the  fragments  aie 
balanced  with  respect  to  transaction  load,  [n  general  terms,  if  'n'  is  die  number  of 
atiributes,  'n-l'  points  on  the  diagonal  are  considered.  At  each  point,  the  matrix  is  "split" 
into  an  Upper  and  Lower  fragment.  A  count  is  made  of  the  total  number  of  access  of 
transactions  that  need  only  fragments  in  the  current  upper  block  (labeled  (CU))  and  current 
lower  block  (labeled  (CL)),  and  a  count  of  the  number  of  O-ansactions  that  need  both 
fragments  (Q).  The  goal  then  is  to  select  the  point  in  the  matrix  .such  that  the  goal  function 
(z)  is  maximized:  max  z  =  CLxCU  -Cl' 

The  second  mechanism  allows  for  partitioning  with  overlapping  fragments.  This  requires 
die  use  of  two  points  on  the  diagonal,  xl  and  x2  where  attributes  between  xl  and  x2 
constitute  the  intersection.  The  goal  ftmction  is  the  same  for  the  non-overiapping  case 
however  consideration  is  given  to  read  only  versus  update  requests.  In  the  case  of  read  only 
there  are  advantages  to  sharing  the  data  among  the  two  fragments  whereas  updates  need  to 
be  directed  to  both  fragments  for  consistency. 
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The  process  as  presented  is  iterative,  as  it  is  most  likely  that  the  vertical  partitioning  will  result  in 
several  fragments.  In  order  to  reduce  the  computational  complexity  of  the  problem,  the  authors 
chose  a  suboptimal  approach  whereby  each  application  of  the  VPP  produces  two  fragments. 
These  fragments  then  become  independent  subproblems  or  fragments  on  which  the  algorithms  are 
repeated  to  further  split  the  fragments.  Tfiis  is  repeated  until  no  further  benefits  are  gained 
[NAVA84]. 

The  final  phase  of  the  methodology  deals  with  the  allocation  of  these  fragments  to  sites.  Four 
cost  factors  are  considered  in  this  stage.  These  include  the  cost  of  irrelevant  attributes  accessed 
within  a  fragment,  cost  of  accessing  fragments  for  retrieval  and  update,  storage  cost  and 
transmission  cost.  These  cost  factors  are  then  assigned  weights  according  to  their  importance  in 
the  overall  optimization  model.  In  a  distributed  environment  the  transmission  costs  receive  the 
highest  weight  in  their  model.  To  summarize,  a  table  of  partitions  versus  allocations  is 
maintained  and  for  each  possible  partition  the  algorithm  attempts  all  possible  fragment  allocations 
(m2  cases  for  m  sites).  The  'least  cost"  pair  is  selected.  In  die  case  of  replicated  sites,  a  final 
algorithm  is  invoked.  This  algorithm  looks  at  each  fragment  independently  and  allocates 
additional  copies  to  sites  until  no  further  benefit  is  gained  [NAVA84]. 

Other  optimization  models  exist  which  address  horizontal  and  replication  issues 
[CERI83a,CERI83b].  These  models  analyze  the  logical  distribution  in  terms  of  objects  and  links 
and  the  relationships  between  them.  They  assume  explicit  knowledge  by  the  user  in  terms  of 
potential  uses  of  the  database,  0-ansaction  frequencies  and  cardinalities  (number  of  instances)  of 
objects  and  links,  etc.  Given  these  inputs  they  produce  fonnal  solutions  to  the  design  problem. 
2.5.3  Summary 

The  methodologies  and  solution  methods  which  have  just  been  reviewed  address  the  distribution 
problem  from  a  higher  level  than  the  pure  file  allocation  models.   This  is  done  by  assessing  the 
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dependencies  between  transactions  or  the  affinity  or  bonding  between  attributes.  These 
approaches  highhght  the  many  issues  that  must  be  taken  into  account  during  logical  design.  It  is 
important  to  note  that  there  does  not  appear  to  be  any  one  model  that  addresses  horizontal  and 
vertical  partitioning  under  one  methodology,  due  to  the  complexity  involved. 

2.6  Related  Work 

As  a  final  note,  Hevner  has  pointed  out  the  importance  of  combined  research  in  the  areas  of  data 
allocation  and  query  optimization.  As  he  states,  the  data  allocation  problem  generally  assumes  a 
given  data  access  pattern  and  the  query  optimization  problem  assumes  a  fixed  data  allocation.  If 
these  assumptions  were  generalized,  designers  could  develop  models  that  combine  die  aUocation  of 
files  and  the  support  of  query  processing  in  the  most  efficient  manner  [HEVN84]. 

Hevner  reviews  some  of  the  recent  work  done  in  this  area.  The  work  of  Elam  minimizes  the 
amount  of  data  sent  for  a  specified  set  of  query  processing  strategies,  under  the  consttaint  that  one 
file  is  stored  separately  from  others  in  the  query  to  promote  parallel  processing.  Another 
approach  by  Apers  develops  an  algorithm  to  minimize  data  ti-ansmission  costs  by  clustering 
relations.  The  constraint  placed  on  this  model  is  that  only  one  copy  of  each  relation  is  stored  in 
the  system  (i.e.,  no  redundancy)  [HEVN84].  In  agreement  with  Hevner,  further  research  in  this 
area  is  needed. 

2.7  Conclusions 

This  chapter  enumerated  the  many  issues  associated  with  the  dislribution  of  data  and  file 
allocation  in  a  distributed  environment.  Several  attempts  to  deal  with  these  problems  have  been 
addressed  in  the  literature.  Eariy  models  focused  solely  on  the  optimal  placement  of  files,  with 
extensions  added  later  to  address  program  and  hardware  allocation  in  paraUel.  The  major 
weakness  of  these  models  was  the  assumption  of  a  file  being  the  unit  of  allocation  with  no 
incorporation  of  file  partitioning  or  file  interdependence.   Other  methodologies  and  solutions  were 
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presented  that  address  these  issues.  These  methodologies  focused  on  transaction  dependencies  and 
bonding  between  attributes  when  analyzing  the  placement  of  data. 

In  summary,  in  a  distributed  environment  a  database  designer  will  be  faced  with  many  different 
problems,  including  how  to  partirion  files,  where  to  place  these  partitions  and  what  degree  of 
redundancy  should  be  incorporated.  Although  an  optimal  solution  to  aU  of  these  problems  is  not 
realistic  at  this  time,  various  models  exist  for  different  classes  of  these  decisions.  Integration  of  a 
solution  to  some  of  these  problems  into  an  interactive  tool,  would  greatly  reduce  the  complexity 
involved.  Such  a  tool  would  provide  a  building  block  on  which  the  designer  could  utilize  his 
judgement,  to  design  a  feasible  distribution  scheme. 
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CHAPTER  J 
DESIGN 

3.1  Objectives 

The  objective  of  this  project  is  to  design  a  tool  that  can  be  used  to  aid  the  database  designer  in  the 
distribution  of  data.  TotaJ  optimajity  and  automation  of  all  aspects  of  the  distribution  problem 
still  requires  a  great  deal  of  research.  A  tool  therefore  can  only  expect  to  provide  guidance  and 
recommend  solutions,  as  ultimately  the  design  process  still  relies  to  a  large  extent  on  the  designer. 
In  that  regard,  the  distribution  algorithms  developed  should  be  integrated  into  an  interactive 
design  tool.  With  such  a  tool,  the  designer  can  review  the  steps  that  have  been  taken  and  modify 
the  results  as  desired. 

3.1.1   Design  Overview 

The  tool  being  designed  is  an  extension  to  two  automated  tools  in  existence  today.  The  first  is  the 
"document_handler"  program  which  processes  application  documents  and  determines  the 
functional  dependencies  (FD's)  they  represent.  The  FD's  are  used  as  input  to  the  "bernl" 
program,  which  produces  a  database  in  third  normal  form  (3NF).  The  data  distribution  tool 
(hereafter  referred  to  as  "dist.data")  then  deals  with  the  question,  what  if  1  now  want  to  distribute 
the  data  ?  A  high  level  overview  of  the  functionality  of  these  tools  is  required  in  order  to 
underetand  the  interdependencies  between  them. 

3.1.1.1  Document.liandler  Program 

In  dstiibuted  processing,  one  of  the  fundamental  keywords  is  document.  A  collection  of  the 
documents  used  in  an  organization  will  tell  you  a  great  deal  about  the  database  schema  required. 
A  complete  set  of  user's  documents  should  supply  all  of  the  data  items  used  by  an  organization. 
Problems  arise  though,  as  different  user  documents  refer  to  the  same  data  item  by  different  names 
(synonyms)   or  the  same   name   is   applied  to  different  data  items  on   different  documents 
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(homonyms).  Another  problem  arises  when  analyzing  these  documents,  as  some  data  items 
names  appear  on  a  user  document  but  are  not  actually  stored  in  the  database.  Tliese  are  derived 
data  items  that  may  be  computed  from  other  data  values.  Tlie  database  designer  must  examine 
these  documents  to  generate  a  nonambiguous  list  of  data  items  for  the  database  [WOEL71J.  This 
can  be  an  arduous  task. 

The  document  handler  tool  was  developed  to  provide  automated  assistance  in  interpreting  these 
documents.  The  tool  takes  as  input  all  user  documents  and  their  associated  columas,  along  with  a 
specification  of  their  use  as  input,  resident  or  output  documents.  The  documents  and  columns  are 
scanned  one  by  one,  and  the  following  activities  take  place  [COHE79]: 

—  Deletion  of  synonym/homonym  names 

—  Removal  of  insignificant  columns  (some  columns  may  not  be  included  in  the  database  due 
to  their  nature.   A  signature  field  for  example) 

—  Solve  undeclared  output 

These  activities  are  accomplished  by  interactive  dialogue  with  the  user.  The  system  contains  a 
sophisticated  mapping  between  all  documents  and  columns.  This  allows  a  cross  reference  listing 
of  any  column  to  any  document  as  well  as  a  listing  of  any  document,  it's  type  and  all  columns  it 
contains.  This  information  is  stored  and  maintained  by  tiie  program. 

Once  a  unified  list  of  document  names  and  attributes  has  been  derived,  keys  are  specified  for  each 
document.  These  keys  are  then  used  in  conjunction  with  the  columns  in  a  document  to  form 
functional  dependencies.  These  functional  dependendes  are  then  used  as  input  into  die  bem2 
program. 

3.1.1.2  Bem2  Program 

The  bem2  program  is  a  tool  which  automates  the  steps  of  Bernstein's  algorithm,  that  produces  a 
database  schema  which  is  in  third  normal  form  (3NF)  with  a  minimal  number  of  relations.  A 
database  in  3NF  contains  no  extraneous  attributes,  no  partial  dependencies  and  no  transitive 
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dependendes  (a  tnmsitive  dependency  exists  when  you  can  get  from  a  key  to  a  non-key  to  a  non- 
prime).  The  removal  of  extraneous  attributes  and  redundant  FD's  is  particularly  applicable  to  the 
FD's  introduced  by  the  document.handler  program,  as  many  documents  will  use  the  same  keys 
and  contain  a  great  deal  of  data  items  in  common. 

The  steps  of  Bernstein's  algorithm  can  be  summarized  as  follows  [COHE79|: 

1.  Eliminate  fiom  the  functional  dependencies  those  data  items  that  can  be  derived  from  other 
functional  dependencies  (extraneous  attributes) 

2.  Eliminate  from   the  set   those  functional   dependencies   that   can   be   derived   from   the 
remaining  set  of  FD's  (redundant  FD's) 

3.  Group  the  remaining  functional  dependencies  into  sets  with  equivalent  left  hand  sides 

4.  Merge  the  groups  that  have  equivalent  left  hand  sides  (i.e. ,  keys) 

5.  Remove  transitive  dependencies  from  the  data  items 

6.  Construct  relations  based  on  the  groups  of  functional  dependencies. 
3.1.1.3  Data  Distribution  Tool 

The  dist_data  tool  is  the  focus  of  this  project.    It  extends  the  work  of  the  tools  just  reviewed,  by 

addressing  the  fact  that  the  resultant  database  may  be  used  in  a  distributed  environment.    It 

utilizes   much   of  the   information   obtained   in   the   document_handler   and   bem2   programs. 

Speafically,  the  document_haiidler  provides  critical  information  about  an  application  which  is 

needed    to    make    intelhgent    distribution    decisions.     This    information    includes    application 

transactions  in  the  form  of  input  ajid  output  documents. 

The  bem2  program  removes  all  extraneous  attributes  and  transitive  dependencies  to  produce  a 
database  in  3NF.  These  relations  coupled  with  the  information  on  customer  documents  provide 
the  basis  for  defining  meaningful  partitions  and  file  placement  onto  nodes  in  a  network. 

As  previously  discussed,  there  are  two  main  concerns  when  deaUng  with  data  distribution.  The 
first  is  to  evaluate  the  utility  of  partitioning  data  objects  into  fragments,  and  the  second  is,  once 
partitioned  how  these  fragments  are  allocated  to  nodes  on  a  network.  This  tool  addresses  both 
these  issues.    The  intent  is  not  to  promise  tine  optimality,  but  to  provide  a  feasible  solution  to 


•21  - 


these  complex  issues. 
3.1.1.4   Limitations 

Some  of  the  requirements  originally  outlined  for  an  automated  docum.ent  handle;  [WOELSI], 
were  not  implemented  in  prototype  document_handler  program.  Specifically,  an  indication  of  the 
frequency  of  use  on  a  per  document  basis  and  the  ownership  of  data  items  (i.e.,  what  document(s) 
own  each  particular  data  item).  These  pieces  of  information  are  required  in  the  distribution 
algorithms  presented.  As  these  modifications  would  not  prove  difficult  to  make,  they  wi'J  be 
assumed  as  input  for  the  purpose  of  this  project. 

3.2  Detailed  Design 

This  section  presents  the  detailed  design  of  the  dist_data  tool.  The  design  algorithm  involves  four 
basic  steps  that  can  be  summarized  as  follows; 

1 .  Identify  application  structure  and  network  topology 

2.  Create  Partition&Tiles 

3.  Determine  File  Placement 

<1.     Analyze  Profitabihty  of  Replication  (Branch  and  Bound) 
5.     Sensitivity  Analysis 

The  steps  of  the  algorithm  are  iterative  in  nature;  step  2  is  repeated  for  all  relations  produced 
from  bem2,  and  steps  3,  4  and  5  are  repeated  for  each  file  or  partition  generated.  Figure  3.1 
provides  a  flowchart  of  this  activity.  The  remainder  of  diis  chapter  will  discuss  each  step  of  tiie 
design  algorithm  in  detail. 
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Flowchart  of  Dist_data  Process 
Figure  3.1 
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3.2.1   Step  1  :  Identify  Application  Structure  and  Network  Topology 

Data  distribution  cannot  begin  without  undenitanding  the  organizational  structure  and  distribution 
scheme  for  any  particular  application,  'llie  firet  step  in  the  design  process  is  to  gather  this 
information  from  the  user.    For  ease  of  use  the  tool  will  prompt  the  user  for  the  needed 

information,  which  includes; 

1 .  Number  of  nodes  on  the  network 

2.  Identification  of  organizational  groupings  (i.e.,  INVENTORY,  BII.LING,  etc.) 

3.  On  a  per  organization  basis: 

—  Wliich  node  the  organization  resides  on 

—  A    list    of    nodes    which    interconnect    with    this    organization    and    the    associated 
communication  costs  of  each  interconnec-non 

—  A  list  of  the  documents  used  by  the  organization 

With  this  information  in  hand,  several  required  pieces  of  information  can  be  compiled.  First,  the 
application  network  topology  has  been  defined.  The  system  may  now  build  cost  tables  which 
reflect  the  transmission  costs  incurred  for  queries  and  updates.  It  is  important  to  point  out  one 
requirement  of  the  system  which  is  needed  by  later  algorithms.  The  file  placement  algorithm  (step 
3)  and  replication  algorithm  (step  4)  require  cost  figures  to  be  associated  between  any  two  pairs  of 
nodes  on  the  network  (i.e.,  a  fully  connected  network).  The  problem  can  be  demonstrated  as 
follows.  Consider  a  system  with  three  nodes,  where  nodes  1  and  3  are  connected  to  node  2,  but 
no  connection  exists  between  nodes  I  and  3.  This  topology  is  illustrated  in  figure  3.2. 

Node(l) 
Node  (2) 


\ 


Node  (3) 

Example  Network  Topology 
Figure  3.2 
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The  cost  tables  which  need  to  be  buiit  require  that  a  cost  be  associated  with  nodes  1  and  3.  In  the 
example  above,  this  can  be  accomplished  by  taking  the  sum  of  the  communication  costs  between 
nodes  I  and  2,  and  nodes  2  and  3.  In  a  situation  where  multiple  paths  are  available,  the  least  cost 
path  should  be  associated  with  the  "imaginary"  link.  ITie  system  should  detect  these  missing  links 
and  compute  the  cost  factor.  Once  a  cost  is  associated  with  all  of  the  nodes  in  the  network,  the 
cost  table  can  be  buiit.  Figure  3.3  shows  an  example  of  a  cost  table  for  three  nodes.  Reading  the 
table  across  tells  you  that  the  cost  of  communication  between  node  I  and  itself  is  zero,  node  1  and 
2  is  8  and  between  nodes  1  and  3  is  12.  Tliis  information  is  required  in  steps  3,  4  and  5  of  the 
algorithm. 


1 

2 

3 

1 

0 

8 

12 

2 

8 

0 

20 

3 

12 

20 

0 

Network  Cost  Table 
Figure  3.3 

The  other  piece  of  infonnation  obtained  from  the  user  deals  with  the  application  structure.  The 
system  now  knows  on  which  nodes  each  document  or  transaction  resides.  This  node  identifier 
must  be  stored  in  conjunction  with  the  document  entries  as  it  is  vital  in  determining  a  meaningful 
partitioning  of  relations  (tihis  will  become  clear  after  reviewing  the  algorithm  for  step  2). 

The  information  obtained  in  this  step  serves  as  a  framework  for  all  other  steps.  It  should  be  noted 
that  this  is  the  only  step  which  requires  information  from  the  user  (all  other  steps  of  the  algorithm 
make  use  of  the  information  provided  by  the  document_handler  and  beni2  programs). 
Appropriate  error  detection  and  recovery  will  be  provided  when  parsing  the  user  input.  This 
includes  syntactic  checks  as  well  as  checks  for  invalid  document  names  or  im^alid  interconnect 
nodes. 
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3.2.2  Step  2:  Create  Partitions/Files 

Step  2  of  the  algorithm  is  a  key  step  in  the  distributed  design  process.  In  this  step,  the  relations 
produced  by  the  bem2  algorithm  arc  analyzed  to  determine  if  partitioning  should  be  applied. 
Recall  that  initialiy  the  application  has  a  list  of  documents  wliich  they  own  and  execute.  The 
document_handler  program  removes  ail  homonyms,  syTionyms  and  insignificant  columns  to  insure 
that  the  data  in  the  document  is  interpreted  correctly.  The  bem2  program  then  takes  these 
documents  (represented  as  FD's)  and  produces  a  3NF  database.  'Ilieso  relations  do  not  reflect 
customer  usage  however,  as  one  relation  may  now  contain  pieces  of  data  from  several  different 
documents.  Chapter  two  reviewed  the  importance  of  analyzing  the  applicadon's  use  of  data  , 
particularly  in  regard  to  the  dependency  which  exists  between  files.  Tliese  dependencies  may 
reveal  clusters  of  data  which  belong  together  or  that  could  form  meaningful  partitions.  Once  these 
partitions  are  identified,  the  file  allocation  algorithms  can  deal  with  the  physical  placement  of 
these  fragments  onto  sites  in  a  network. 

Several  different  approaches  which  have  been  taken  in  dealing  with  partitioning  were  discussed  in 
Chapter  two.  The  heuristic  chosen  for  this  design  process,  is  to  analyze  the  frequency  of  use  of 
each  attribute  in  the  relation  on  a  per  node  basis,  and  cluster  these  attributes  together  into  a  file. 
This  approach,  although  simplified  is  very  similar  to  the  approach  used  by  Baker.  Ihe 
justification  for  this  algorithm  is  as  follows: 

—  The  grouping  of  attributes  by  frequency  of  use  agrees  with  the  underlying  message  in  the 
literature;  store  the  data  where  it  is  most  frequently  used. 

—  Partitioning  the  data  in  this  manner  usually  results  in  partitions  being  stored  at  the  same 
node  as  the  owner  of  those  attributes.  This  proves  beneficial,  as  most  of  the  time  people 
are  "greedy"  in  the  sense  that  they  want  to  maintain  control  over  the  data  they  own. 

—  Availabihty  of  information:  The  document_handler  program  contains  all  the  needed 
information  to  analyze  the  relations  in  this  manner.  Each  document  entry  contains  all 
related  data  items  and  the  frequency  of  use.  Step  1  of  the  distribution  algorithm  also 
associated  a  node  with  each  document.  All  of  this  information  is  readily  available  and 
feasible  for  the  scope  of  this  project. 
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In  order  to  accomplish  the  partitioning,  three  steps  are  needed.  These  steps  build  a  document 
table  associated  with  the  relation  and  sort  it  by  node,  analyze  the  data  frequencies  and  then  form 
partitions  based  on  this  frequency.  At  the  conclusion  of  this  step  the  designer  may  review  the 
partitioning  which  has  taken  place,  and  make  modifications  if  desired.  These  functions  work  on 
one  relation  from  the  bem2  algorithm  at  a  time.  They  are  therefore  repeated  until  all  relations 
have  been  processed. 

The  phases  of  this  step  are  reviewed  in  detail  below. 

3.2.2.1   Build  Document  Table 

Figure  3.4  and  3.5  contain  one  relation  produced  by  beni2  and  three  document  entries  as  they 
would  look  after  step  i  of  the  distribution  process. 


RELl  (KEY_ATTR,  ATTRl,  ATTR3,  ATTR5) 

Bem2  Relation 
Figtire  3.4 


NODE  FORM  KEY_ATTR  ATTRl  ATTR2      ATTR3      ATTR4 

1  forml  10  10*  10  10  10* 

NODE  FORM  KEY.ATTR  ATTRS  ATTR8 

3  form2  10  10  10 

NODE  FORM  KEY.ATTR  ATTRS  A1TR8 

3  forni3  5  5  5 

Document  Entries 
Figure  3.5 

The  first  step  is  to  build  a  document  table  which  corresponds  to  the  relation  being  examined.   In 
this  example,  only  those  documents  using  the  key  attribute  and  attributes  1,3  and  5  are  of 


■27- 


concern.  This  table  is  then  sorted  by  node  for  ease  of  processing  in  the  next  step.  Figure  3.6 
shows  the  document  table  at  the  conclusion  of  this  step.  Note  that  attributes  2,4  and  8  used  by 
the  forms,  do  not  appear  in  this  table. 


NODE  DOC  KEY_ATTR  ATTRl       A'lTR?      ATTR5       ATTR? 

1  forml                10  10*  10 

3  form2               10  10 

3  form3                5  5 


Document  Table 
Figure  3.6 

3.2.2.2  Analyze  Frequency  of  Use 

In  this  step,  each  data  item  or  attribute  is  examined  individually  and  an  aggregate  usage  at  each 

node  is  determined.   Two  important  assumptions  are  included  in  this  step: 

1.  Keys  are  not  analyzed,  as  the  key  to  this  relarion  is  required  in  any  partition  created  (i.e., 
the  key  must  be  duplicated  for  access) 

2.  An  access  to  an  attribute  by  the  owner  is  taken  as  an  update  transaction  (recall  that  an 
owner(s)  of  a  data  item  must  be  indicated.  This  is  represented  by  a  '*'  in  these  examples. 
It  is  possible  for  more  than  one  owner  to  exist  for  any  given  attribute.)  These  accesses 
represent  exception  rates  such  as  new  customers  being  entered,  etc 

Using  the  example  in  Figure  3.6,  the  following  data  would  be  compiled: 


-  ATTRl 

-  ATTR3 

-  ATTRS 


10  updates  node  1 

10  queries  node  1,  3  queries  node  3 

10  queries  node  3 


3.2.2.3  Create  Partitions 

This  step  processes  the  frequencies  above  to  determjne  the  partitioning.  In  most  cases  the 
aggregate  total  of  query  and  update  requests  from  a  node  is  used  to  determine  the  owner.  In  the 
case  where  a  query  count  from  one  node  equals  the  update  count  from  another  node,  the  update 
access  is  given  the  higher  priority  (It  is  recognized  that  further  heuristics  could  be  applied  when 
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queries  are  vei-y  close  to  updates,  etc.  Further  heuristics  are  not  addressed  in  this  design).  Each 
attribute  is  therefore  associated  with  the  node  which  contains  the  highest  frequency  or  use.  The 
final  step  combines  all  attributes  from  the  same  node  into  one  file. 

The  final  parritions  resulting  from  the  example  above  are: 

FTLEl  =  (KEY^ATTR,  ATTRl,  ATTR3) 
FILE2  =  (KEY.ATTR,  ATTR5) 

3.2.2.4  Query  User 

After  the  paititions  have  been  formed,  the  results  should  be  displayed  to  the  user.    At  this  point 

the  user  will  be  allowed  to  modify  the  partitioning  if  desired.   Three  ojwrations  mil  be  allowed: 

1.  MOVEATTR to  FILE 

2.  CREATE  FILE  (This  will  create  a  new  partition,  which  may  then  be  populated 

through  a  series  of  MOVE  requests) 

3.  MERGE  FILE and  FILE 

These  operations  will  again  be  prompted  for,  so  that  minimal  effort  is  required  on  the  user's 
behalf. 

3.2.3  Step  3:  Determine  File  Placement 

Once  the  relation  has  been  partitioned  into  files,  each  of  these  files  needs  to  be  analyzed 
individually  to  determine  where  on  the  network  they  should  be  placed.  The  objective  ftmction 
chosen  is  to  place  the  file  at  the  location  which  minimizes  the  oveiall  communication  costs  in 
regard  to  update  and  query  requests.  To  accomplish  this,  ti-ansacrion  tables  must  be  built  to 
indicate  the  frequency  of  requests  issued  against  this  file.  Query  and  update  tables  are  then  built 
and  analyzed  to  determine  the  optimal  placement  of  the  file. 

3.2.3.1   Build  Transaction  Table 

A  transaction  table  needs  to  be  built  for  each  partition  or  file  produced  in  step  2.  The  table 
contains  a  count  of  the  number  of  queries  and  updates  issued  against  this  file  from  each  node. 
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This  informatio!!  was  previously  gathered  for  all  attributes.  This  step  however  is  only  concerned 
with  those  attributes  associated  with  the  partition  being  analyzed.  The  same  rules  for 
distinguishing  between  query  and  update  requests  in  step  2  ,  also  apply  to  this  step.  In  addition, 
one  other  assumption  is  made: 

—  It  is  assumed  that  all  attributes  in  a  document/file  are  accessed  together  (i.e.,  cannot  access 
individual  attributes  out  of  a  file,  the  entire  file  is  retrieved).  This  implies  that  if  a  form  has 
some  queries  and  some  updates  against  the  relation,  an  update  transaction  is  assumed.  It 
would  not  seem  appropriate  to  consider  these  as  two  separate  accesses,  so  one  access  is 
assumed  with  updates  weighing  more  heavily. 

The  output  of  this  process  will  be  a  transaction  table  which  holds  the  associated  access 
frequencies. 

Figure  3.7  shows  an  example  of  a  transaction  table  (an  additional  node  has  been  added  from 
previous  examples). 


Nodes      Query      Update 


1  10  10 

2  30  10 

3  15  0 


Transaction  Table 
Figure  3.7 

3.2.3.2  Build  Cost  Table 

Cost  tables  must  now  be  created  to  reflect  the  query  and  update  costs  that  would  result  from 
placing  the  file  on  any  given  node.  These  costs  are  computed  using  the  transaction  table  and  the 
network  cost  obtained  in  step  1. 
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For  example,  assume  the  following  entries  exist  in  the  transaction  table  and  network  cost  table  for 
Nodel: 

Transaction  Table  Network  Cost 

Node  Query  Update 
1              10  10 

.2  30  10 

3  15  0 

The  query  costs  associated  with  placing  the  file  at  node  1  can  be  computed  as  follows: 

1 .  Node  I  to  itself:  cost  =  0 

2.  Node  2  to  node  1:  cost  =  240  (30  query  requests  from  node  2  at  a  cost  factor  of  8) 

3.  Node  3  to  node  I:  cost  =  180  (15  query  requests  from  node  3  at  a  cost  factor  of  12) 

4.  Total  query  costs  incurred  =  420 

The  same  algorithm  is  used  to  determine  update  costs.  The  total  cost  associated  with  each  node  is 
the  sum  of  query  and  update  costs.  These  costs  are  computed  for  every  node  on  the  network. 
The  node  representing  the  minimum  cost  is  chosen  for  file  placement. 

3.2.4  Step  4:  Determine  Profitability  of  Replication 

The  next  step  of  the  distribution  is  to  determine  the  profitability  of  repUcation.  This  is 
accomplished  by  building  a  decision  tree  and  performing  a  branch  and  bound  search. 

In  theory,  the  branch  and  bound  search  looks  for  an  optimal  solution  by  defining  initial  upper  and 
lower  values  of  the  objective  function  (in  this  case  the  objective  function  is  the  minimization  of 
communication  costs).  From  the  feasible  solutions,  the  best  solution  is  made  the  upper  value  (U) 
of  the  problem.  All  other  solutions  are  matched  against  this  solution  in  an  attempt  to  find  a 
better  solution.  Any  solution  which  produces  a  value  higher  than  (U)  are  deleted,  as  further 
branching  would  not  lead  to  a  better  solution.  This  process  continues  through  a  series  of 
iterations  in  an  attempt  to  find  the  optimal  solution  [GREEN78]. 
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The  branch  and  bound  search  technique  was  reviewed  briefly  in  Qiapter  two.  This  discussion 
highlighted  the  fact  that  even  though  Uiis  search  procedirre  bounds  the  number  of  required 
computations,  heuristics  are  still  needed  to  reduce  the  potential  computational  requirements.  In 
this  light,  ttie  hueristic  identified  for  this  design  is  a  simple  "greedy"  heuristic.  At  any  given  point 
in  the  search,  only  the  best  solution  is  kept.  Although  it  is  reahzed  that  this  can  not  guarantee  the 
true  optimal  placement  of  the  file,  it  provides  a  reasonable  placement  within  the  scope  cf  this 
project. 

To  correlate  the  above  discussion  to  the  file  allocation  problem,  consider  the  placement  of  a  file 
on  a  network  consisting  of  three  nodes.  Three  levels  of  the  decision  tree  are  depicted  in  figure  3.8 
for  purposes  of  this  discussion. 


0  0 


\ 


(level  1)      1.0,0         0   10         0,0   1 
(Level  2)     110  10   1  I  I 


(Level  3)      I    1    1 


Partial  Decision  Tree 
Figure  3.8 

At  the  first  level,  each  vertex  represents  a  file  assigimient  to  a  given  node  (denoted  by  I's  in  those 
positions  corresponding  to  file  nodes,  O's  elsewhere).  The  file  placement  algorithm  discussed  in 
step  4,  determines  the  optimal  placement  of  one  copy  of  the  file  to  the  network.  For  this 
example,  assume  this  is  node  1  (represented  by  a  100  at  level  1  on  the  graph).   With  the  "greedy" 
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hueristic  applied,  the  search  will  now  only  be  concerned  with  replication  schemes  that  involve 
node  1.  The  search  proceeds  to  level  2.  where  the  replication  choices  are  a  copy  of  the  file  at 
nodes  1  and  2,  or  a  copy  at  nodes  1  and  3.  The  cost  associated  with  each  of  these  choices  is 
analyzed.  Obtaining  the  cost  figure  is  very  similaj-  to  the  file  placement  algorithm;  however,  it  is 
modified  slightly  to  account  for  the  replication.  For  example,  when  analyzing  the  costs  associated 
with  file  copies  allocated  to  nodes  I  and  2  the  following  changes  will  be  made.  The  cost 
associated  for  queries  at  both  node  1  and  node  2  are  zero,  as  each  node  will  now  have  a  local 
copy.  A  query  from  node  3  can  now  be  satisfied  from  either  node  I  or  2,  .so  the  lower  cost  is 
assumed.  Update  costs  aie  increased  however,  as  updates  must  be  propagated  to  all  file  copies. 
In  this  case,  the  update  cost  associated  with  the  rephcation  would  be  the  sum  of  update  costs 
associated  with  nodes  1  and  2.  Again,  the  total  cost  is  the  sum  of  query  and  update  costs.  If  this 
cost  is  less  than  the  cost  assodated  with  only  1  copy  of  the  file,  then  this  choice  is  now  considered 
the  "best"  placement  and  the  search  continues  to  level  3  (which  represents  total  rephcation  in  this 
example).  If  the  costs  associated  with  replication  is  higher,  the  search  has  ended  (after  both  nodes 
are  tested). 

3.2.5  Step  5:  Sensitivity  Analysis 

As  the  rephcation  decision  may  be  far  from  optimal,  a  sensitivity  analysis  is  being  provided.  The 
intent  of  this  step  is  to  allow  the  designer  to  modify  the  communication  costs  associated  with  a 
given  topology  and  re-analyze  the  file  placement.  In  other  words,  the  designer  will  be  able  to 
request  a  reiteration  of  the  branch  and  bound  search  with  the  communication  cost  increased  by 
some  specified  amount.  If  the  result  after  this  run  differs  significantly  from  the  first,  tiiis  may 
indicate  that  replication  is  desirable  in  either  case.  For  example,  if  a  5%  increase  in 
communication  costs  shows  that  replication  is  desirable  this  would  indicate  that  the  file  placement 
is  very  sensitive.  This  type  of  sensitivity  may  imply  that  replication  is  in  order. 
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At  the  conclusion  of  this  step,  the  file  allocation  process  has  been  completed  for  one  file  or 
partition.  Each  original  relation  may  be  split  into  several  partitions.  Steps  3,  A  and  5  (if  desired) 
will  need  to  be  repeated  for  each  partition. 

3.3   Summary 

This  chapter  presented  the  design  for  the  dist^data  tool.  Iliis  tool  takes  the  information  regarding 
customer  documents  coupled  the  3NF  relations  they  represent,  and  automates  a  distribution 
analysis.  This  analysis  includes  recommended  partitioning  for  the  relations  based  on  file 
dependencies  and  frequency  of  use,  and  the  allocation  of  these  partitions  onto  the  nodes  in  a 
network.  The  allocation  scheme  is  based  on  minimizing  communications  costs  for  update  and 
query  transactions.  The  process  is  iterative,  examining  one  relation  at  a  rime  and  then  each 
resulting  partition  in  that  relation.   This  process  is  repeated  until  all  relations  have  been  analyzed. 

The  interface  to  the  tool  is  interactive  in  natiire,  allowing  the  designer/user  to  make  modifications 
if  desired  to  appropriately  steer  the  distiibunon  design.  A  minimum  amount  of  error  detection 
and  recovery  has  been  provided. 

Chapter  four  contains  a  detailed  example  of  the  dist_data  operation. 
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CHAPTER  4 
EXAMPLE  DESIGN 

4. 1   Example  Design/Results 

This  chapter  presents  a  complete  design,  following  the  steps  outlineJ  in  Chapter  three.  The 
system  used  in  this  example,  models  a  manufacturing  firm  which  consists  of  three  organizations: 
OJST.REL,  ORDER/BILLING  and  INVE^f^ORY.  These  organizations  will  be  distributed 
across  a  three  node  network. 

The  example  starts  by  showing  a  logical  listing  of  the  documents,  as  they  would  appear  at  the  end 
of  part  one  of  the  document  .handler  program.  Appendix  I  contains  a  list  of  these  documents. 
Each  document  is  labeled  with  a  type  (i.e.,  INPUT,  RESIDENT  or  OUTPUT)  and  has  a 
document  key  specified.   An  owner  is  also  indicated  for  each  data  element  (denoted  by  a  '"). 

Appendix  II  contains  the  FD's  as  produced  by  the  documentjiandler.  These  FD's  are  used  as 
input  into  the  beni2  program.  At  the  conclusion  of  the  bem2  run,  two  3NF  relations  are 
produced. 

The  dist_data  tool  commences  after  the  results  from  the  above  programs  have  been  obtained. 
Appendix  III  contains  a  step-by-step  example  of  the  procedures  used.  The  remainder  of  this 
chapter  will  summarize  the  processing  that  occurs  at  each  step. 

llie  first  step  of  the  dist.data  tool,  is  to  identify  the  application  structure  and  network  topology 
((in.l).  This  notation  will  be  used  throughout  the  chapter  for  Appendix  ni,  point  1). 
Information  is  solicited  from  the  user  regarding  the  organizational  structure,  the  documents  used 
and  the  network  topology.  Note  that  in  the  example,  the  system  must  detect  the  missing  node 
interconnection  between  nodes  2  and  3,  and  search  for  the  least  cost  connection  (in  this  case  theiB 
is  only  1  path  between  nodes  2  and  3).   At  die  conclusion  of  this  step,  a  network  cost  table  has 
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been  built  and  each  node  has  a  list  of  associated  documents  that  are  resident  at  the  node. 

Tlie  next  operation  examines  the  utility  of  partitioning  (III. 2).  This  step  of  the  algorithm  deals 
with  one  relation  at  a  time,  which  in  this  case  is  REI.l  produced  by  bem2.  The  table  is 
constructed  by  identifying  all  documents  which  use  the  attributes  associated  with  this  particular 
relation.  The  owner  of  a  data  item  is  indicated  by  a  '*',  and  the  numbers  in  the  table  represent 
the  frequency  of  use.  These  frequencies  are  then  analyzed  on  a  per  data  item  basis  to  determine 
the  owner  (i.e.,  node)  of  that  data  item  (denoted  by  an  'X').  All  data  items  with  a  common 
owner  are  then  merged  to  form  a  partition.  In  this  example,  Node  1  exhibited  the  highest 
frequency  of  use  for  all  data  items  and  therefore  no  partitioning  was  done.  At  this  pomt  the 
designer  may  display  the  partitioning  that  has  been  created  and  perfonm  modifications  if  desired 
(see  section  3.2.2.4). 

Step  3  of  the  algorithm  determines  the  placement  of  all  partitions/files  produced.  It  deals  with 
one  partition  at  a  time,  so  it  is  repeated  for  each  partition  produced  in  step  2  (in  the  example  so 
far,  only  1  partition  has  been  created).  Step  3  begins  by  building  a  transaction  table  (ni.3).  This 
table  represents  the  aggregate  usage  of  this  file  from  all  nodes.  Recall  from  section  3.2.3.1,  there 
are  two  critical  assumptions  used  when  building  this  table.  The  first  is  that  access  by  an  owner  is 
interpreted  as  an  update  transaction.  Secondly,  a  document  or  file  is  assumed  to  be  accessed  in 
entirety.  The  importance  of  these  assumptions  can  be  highlighted  by  examining  the  HIST_FILE 
document.  This  document  accesses  c_no  and  part_no  as  a  query  request  and  tot_price  as  an 
update.  This  transaction  from  node  1  is  interpreted  as  an  update  request  of  frequency  30.  The 
remainder  of  the  requests  from  node  1  are  query  only,  so  the  table  is  populated  with  30  query 
requests  and  30  updates. 

Once  the  transaction  table  has  been  built,  the  cost  for  query  and  update  resulting  from  placing  the 
file  on  any  given  node,  can  be  determined.    The  network  cost  table  (from  step  1)  is  used  in 
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combination  with  the  transaction  table,  to  compute  these  costs.  For  example,  Node  1  issues  30 
queries  and  30  updates.  If  this  file  were  placed  on  Node  1,  there  would  be  no  cost  for  queries  or 
updates  fi'om  Node  1.  as  all  the  information  is  local.  However,  there  are  quei-y  and  update 
requests  to  this  file  from  Nodes  2  and  3.  Node  2  issues  6  queries  and  5  updates  with  a  cost  factor 
of  8.  Reading  the  query,  update  and  network  tables  down  the  Node  I  column,  this  means  if  the 
file  were  placed  at  Node  1,  a  cost  of  48  would  be  incurred  in  queries  and  40  in  updates  from 
Node  2.  The  computation  continues  in  this  manner  in  order  to  analyze  the  total  cost  from  any 
node.  The  node  which  represents  the  minimum  cost  is  recommended  for  the  file  placement  (Node 
1,  in  this  example). 

After  a  single  file  placement  is  recommended,  the  tool  looks  at  the  profitability  of  repUcation 
(III.4).  This  represents  step  4  of  the  distribution  algorithm.  As  previously  discussed  in  section 
3.2.4,  the  tool  progresses  only  with  the  'Twst  choice"  at  any  given  time.  From  the  preceding  step, 
node  I  is  chosen  for  the  file  placement.  Rephcation  options  at  level  two  of  the  cost  graph  include 
two  copies  of  the  file,  at  either  nodes  1  and  2,  or  node  1  and  3.  The  costs  are  recalculated  under 
the  assumption  of  repUcation,  to  determine  if  either  option  results  in  a  lower  cost.  In  the 
example,  replication  proves  profitable  at  nodes  1  and  3.  Therefore,  the  final  recommendation  for 
file  1  is  to  replicate  the  file  at  nodes  1  and  3,  at  a  total  cost  of  572. 

The  final  step  of  the  algorithm  (step  5)  aUows  the  designer  to  perform  sensitivity  analysis  (111.5). 
This  has  not  been  illustrated  in  the  example,  but  would  involve  receiving  a  cost  increase  parameter 
from  the  user  (i.e.,  5%),  and  recomputing  the  cost  tables  and  reanalyzing  the  distribution. 

Appendix  m.6  displays  the  network  after  the  placement  of  file  1.  The  communication  traffic 
resulting  from  this  placement  is  also  iUustrated.  In  analyzing  this  placement,  there  are  two  points 
of  interest.  A  general  view  taken  by  [ROTHSl]  discussed  in  Chapter  two,  stated  that  in  reality 
queries  tend  to  be  large  and  complex  whereas  updates  tend  to  be  small  and  simple.    A  file 
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placement  that  minimizes  query  traffic  is  therefore  beneficial.  The  file  placement  produced  in  this 
algorithm  has  modeled  this  assumption.  The  major  query  requests  have  been  fulfilled  at  both 
nodes  1  and  3.  Updates  do  incur  communication  expense  but  these  updates  are  small  (part_no  and 
tot_price).  The  frequency  of  access  to  this  file  at  Node  2  is  so  low,  that  further  replication  would 
not  prove  advantageous.  The  .second  theory  discussed  in  the  literature,  is  that  a  single  copy  of  a 
file  suffices  under  most  conditions,  if  the  ratio  of  query  to  updates  is  more  than  or  equal  to  50% 
[CASE72].  Looking  back  at  the  n-ansaction  table  for  file  1,  the  ratio  of  query  to  update  (77 
queries,  37  updates)  is  48%.  The  results  of  this  algorithm,  have  stayed  within  these  guideUnes  (it 
should  be  noted,  that  this  comparison  holds  throughout  the  remainder  of  this  example.  All  file 
placements  meet  these  guidelines). 

Processing  now  returns  to  step  2  of  the  algorithm,  to  begin  analysis  of  the  next  bem2  relation 
(tlt.7)   At  the  conclusion  of  this  step,  REL2  has  been  partitioned  into  two  files. 

Step  3  of  the  algorithm:  determine  file  placement,  is  once  again  invoked  for  File  2  (recall  that  this 
step  analyzes  a  single  partition  at  a  time).  Ihe  processing  steps  are  identical  to  that  of  file  1 ;  the 
node  that  minimizes  the  total  costs  is  found  (in.8)  and  then  replication  is  analyzed  (in.9).  The 
final  recommended  placement  for  file  2,  is  again  two  copies  of  the  file  residing  at  nodes  1  and  3. 
The  total  cost  for  this  placement  is  264.  111.11  shows  the  overall  view  of  the  network  with  both 
files  1  and  2  allocated. 

To  complete  the  processing,  one  partition  is  left  to  analyze  (File  3).  The  algorithm  returns  to  step 
3,  to  determine  the  single  file  placement  for  this  partitioning  (HI.  12).  Again,  the  processing  is 
identical  to  that  of  file  1  and  2  and  is  iUustrated  in  01.12  -  in.l4.  In  the  case  of  file  3,  replication 
is  not  profitable  and  therefore  a  single  copy  of  the  file  is  placed  at  node  3.  The  cost  associated 
with  this  placement  is  0. 
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Appendix  III.  15  shows  the  final  results  after  the  dist.data  program  has  processed  all  relations  and 
partitions.  To  summaiize,  file  1  and  2  are  replicated  on  nodes  1  and  3,  and  a  single  copy  of  file  3 
exists  at  node  3,  The  overall  cost  resulting  from  these  allocations  is  836. 
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CHAPTER  5 
CONCLUSIONS 

5.1   Discussion  of  Results 

The  goal  of  the  system  presented  in  this  paper  is  to  assist  the  database  designer  in  finding 
solutions  to  the  problems  of  file  partitioning  and  file  placement  in  a  distributed  system.  The  basic 
premise  of  the  system  is  that  user  documents  contain  the  needed  information  to  make  reasonable 
distribution  decisions.  A  complete  set  of  user  documents  supply  the  data  items  used  by  an 
organization  and  model  the  transactions  (in  the  form  of  input  and  output  documents).  This 
information,  coupled  with  an  understanding  of  the  ownership  of  these  documents,  allows  file 
partitioning  and  file  placement  to  be  made  based  on  the  frequency  of  use  and  ownership  of  the 
data  items.  By  automating  the  analysis  of  this  data  and  proposing  solutions,  a  great  deal  of 
complexity  is  removed  from  the  designer.  This  is  not  to  suggest  that  this  is  the  optimal  solution 
nor  the  only  solution  to  the  problem,  as  there  are  many  issues  which  can  impact  these  design 
choices.  As  such,  the  system  was  designed  to  be  interactive  in  nature,  allowing  the  designer  to 
make  modifications  as  required  to  appropriately  steer  the  design  process. 

Research  continues  to  address  the  key  problems  of  file  partitioning,  file  placement  and 
redundancy  considerations,  in  an  effort  to  find  optimal  solutions.  The  underlying  message  in  all 
these  solutions  is  clear  however;  a  major  determinant  in  the  increased  use  of  distributed  processing 
will  be  an  increased  ability  to  get  data  where  it  is  needed,  and  used  most  frequently. 

5.2  Extensions 

Numerous  enhancements  could  be  made  to  increase  the  benefits  derived  from  this  system.  These 
include  extensions  to  the  partitioning  and  file  allocation  heuristics,  as  well  as  the  overall  user 
interface  to  the  system. 
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The  partitioning  scheme  developed  for  this  system  does  not  directly  address  partitioning  by 
occurrence,  or  horizontal  partirioning.  Although  the  system  will  handle  the  same  organizations 
existing  at  multiple  nodes  and  maintaining  ownership  of  the  same  documents,  it  will  assign  the  file 
to  one  of  the  two  nodes  only.  For  example,  there  is  nothing  to  preclude  the  user  from  entering 
the  ORDER/BILLING  organization  on  two  different  nodes,  each  owning  the  same  documents.  If 
the  frequency  of  use  were  the  same  against  each  document,  the  system  will  arbitrarily  designate 
one  node  as  owner.  The  single  file  allocation  of  this  file  would  result  in  the  node  that  represents 
the  minimum  communication  cost.  In  this  case  the  algorithm  will  most  often  conclude  that 
replication  is  profitable  at  the  second  node,  imless  the  communication  costs  vary  significantiy. 
The  algorithm  will  not  however,  correctiy  address  the  other  remote  requests  to  this  file,  as  it 
assumes  both  files  are  identical.  The  second  problem  arises  when  the  frequency  of  access  varies 
between  the  two  locations  (i.e.,  more  activity  with  certain  part  numbers  than  others).  In  this 
case,  the  algorithm  will  cater  to  the  node  which  exhibits  the  higher  access  rate.  These  problems 
point  out  the  importance  of  allowing  input  from  the  designer,  as  the  designer  can  easily  remedy 
this  situation  by  creating  and  allocating  new  partitions.  A  more  sophisticated  method  of  detecting 
and  deahng  with  horizontal  partitioning  would  be  desirable  however. 

Anothei  specific  enhancement  that  could  be  applied  to  the  partitioning  algorithm,  is  to  supply  an 
improved  heuristic  for  differentiating  between  updates  and  queries.  The  current  design  gives 
weight  to  an  update  in  the  case  where  query  and  update  rates  are  identical,  but  does  nothing 
more.  It  would  be  desirable  to  further  analyze  this  difference  and  extend  the  heuristic  (i.e.,  if 
there  were  only  10%  more  queries  than  updates,  should  the  algorithm  still  favor  updates  ?). 

Other  hmitations  of  the  system  exist  in  the  file  allocation  and  replication  heuristics.  The  model  is 
very  restrictive,  in  that  it  only  considers  query  and  update  communication  costs.  No  consideration 
is  given  to  such  things  as  storage  restrictions  and  costs,  communication  channel  load,  etc.  Chapter 
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two  highlighted  several  different  models  which  take  various  system  resources  into  consideration. 
The  integration  of  any  one  (or  more)  of  these  issues  into  the  model  would  certainly  provide  a 
more  refined  file  allocation. 

Finally,  limited  attention  was  given  to  the  design  of  the  user  interface  to  the  tool.  A  primitive 
method  of  displaying  and  modifying  the  information  was  described,  that  although  is  sufficient, 
could  be  greatly  enhanced.  One  could  envision  a  menu  driven  system  with  graphical  display 
capabiUties,  in  which  the  designer  could  control  the  total  execution  of  the  system,  requesting  any 
of  the  steps  to  be  run  in  any  order.  The  graphics  would  allow  a  cleaner  display  of  the  processing 
at  any  given  step,  or  a  visual  view  of  the  network  topology  and  file  placement. 
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APPENDIX  I 
EXAMPLE  DESIGN:  DOCUMENT.HANDLER  OUTPUT 

Each  DOajMENT  is  labeled  witli  a  type  (INPUT,  OUTPUT,  RESrDENT),frequency  of  use 
(i.e.,  per  day),  and  has  a  document  key 

Each  DATA_ELEMENT  has  an  owner(s)  associated  with  it  (indicated  by  '*'). 


NEW_CUST 

Doaoment  Attributes: 

Location. output 

Frequency:  5 

*cust_no  (doc.key) 

•c_name 

•c_str 

*c_cty 

*c_.sta 

*c  zip 

•c_ph 

CUST.FILE 

Document  Attributes: 

Location,  resident 

Frequency;  5 

c_no  (doc.key) 

c_name 

c_str 

c_cty 

c_sta 

c_zip 

c_ph 
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CUST.LIST 

DocumenE  Attributes: 

Location. output 

Frequency:  1 

c_no  (doc. key) 

c_nanie 

c_str 

c_cty 

c_sta 

c_zip 

c_ph 

NEW_PART 

Document  Attributes; 

Location. input 

Frequency:  2 

*part_no  (doc. key) 

*part_desc 

*part_price 

PART_SUP 

Document  Attributes: 

Location. input 

Frequency:  10 

part_no  (doc.  key) 

part_desc 

•qty_made 

*made_date 

part_price 

CALC.INV 

Document  Attributes: 

Location,  resident 

Frequency:  30 

part_no  (dix.key) 

qty_made 

made.date 

*niade_to_date 

qty_ord 

cord_date 

•ord_to_date 

*qty_reni 
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INV_REPT 


part_no  (doc.  key) 

part_desc 

made_.to..date 

ord_to_date 

qty_rem 


Document  Attiibutes: 
Local!  on. ourput 
Frequency;  1 


CUST.ORD 


c_no  (doc.  key) 

c_name 

c_str 

c_cty 

c_sta 

c_zip 

c_ph 

part_no 
*qty_ord 
*cord_date 


Document  Attiibutes: 
Location. input 
Frequency:  20 


HIST_FILE 


c_no  (doc. key) 
part_no  (doc.  key) 
cord_date  (doc.  key) 
qty_ord 
pait_price 
*tot_price 


Document  Attributes: 
Location  resident 
Frequency:  30 
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INVOICF  Document  Attributes; 

I-X>cati  on.  output 
Frequency:  10 

c_no  (doc.  key) 

c_name 

c_str 

c_cty 

c_sta 

c_zip 

c_ph 

paTt_no 

part_price 

qty_ord 

tot_price 

cord_date 
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APPENDIX  n 
EXAMPLE  DESIGN:  FUNCTIONAL  DEPEM5ENCIES/BERN2  OLTPUT 

**  FUNCTIONAL  DEPENDENCIES  CREATED  FROM  DOCUTVIENT.HANDLER  •• 

cust_no  -->  c_name,  c_str,  c_cty,  c_sta,  c_zip.  c_ph 

part_no  ~>  part._desc 

part_n<)  -->  part.desc,  qty^made,  made_,date,  part_price 

part.no  -->  qty_made,  made_date,  made_to_date,  qty_ord,  ord_to_date,  qty_rem 

part_no  -->  part_desc,  made_co_date,  ord_to_date,  qty_rem 

c_no  -->  c^name,  c_sn,  c_cty,  c_sta,  c.zip,  c_.ph,  part_no,  qty_ord,  cord_date 

c_iio,  part_no,  cord_date  -->  qty_ord,  part_price,  tot_price 

c.no  ->    c_name,   c_str,   c_cty,   c_sta,   c_zip,   c_ph,   part_no,   part_price,   qty_ord  tot_price, 
cord_date 


*•  3NF  RELATIONS  PRODUCED  BY  BERN2  •• 

RELl  :  c_no  ->  c_namc,  c_str,  c_cty,  c._sta.  c_2ip,  c_ph,  part_no,  tot_price 

REL2   ;   part_no  ->   qty_made,    made_date,   qty.ord,   cord_date,    part.desc,    made.to_date, 
ord_to_date,  qty.rem,  pan_price 
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APPENDIX  m 
EXAMPLE  DESIGN:  DIST_DATA 

1)  Identify  Application  Structure  and  Networlt  Topology 

(Algorithm  Step  I) 

PART  A: 

Enter  No.  Nodes  on  Network:   3 
Enter  No.  of  Organizations:  3 

Enter  Org.  Name:                                ORDER/BILLING      INVENTORY 
ORDER/BILLING  on  Node?                              1                                3 
Enter  Interconnect  Nodes  &  Cost:                   2,8                            1,12 

3,12                          

CUST_REL 

i- 

1.8 

Enter  Documents  Owned:                         CUST_ORD            N"EW_PART 

HIST.FILE              PART_SU? 

INVOICE               CA1.C_INV 

INV.REPT 

NEW_CUST 
CUST.FILE 
CUST  LIST 

Part  B:  Logical  View  of  Networlc: 

/       Node  2       A 
y\(OJST_REL)  J 

8        y^ 

C      Node  1            \ 
i^ORDER/BILLINGy 

12     \^^ 

N/"      Node  3        A 
^(EMVENTORVy 
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Part  C:  Build  Cost  Tables 


(Nodes) 


1 

7 

■>. 

1 

0 

8 

12 

2 

8 

0 

20 

3 

12 

20 

0 
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2)  Create  Partitions/'Files:  Relation  1 

(Algorithm  Step  2) 
Part  A:  Build  Document  Table  (Sort  by  Node) 


c 

c 

c 

c 

c 

c 

c 

part 

tot 

Node:  Doc 

no 

name 

str 

cty 

St 

zip 

ph 

no 

price 

1  CUST_ORD 

20 

20 

20 

20 

20 

20 

20 

20 

1  HIST.FILE 

30 

30 

30* 

1  INVOICE 

10 

10 

10 

10 

10 

10 

10 

10 

10 

2  NEW_CUST 

5' 

5* 

5* 

5- 

5* 

5* 

5' 

2  CUST_FTLE 

5 

5 

5 

5 

5 

5 

5 

2  CUSr.LIST 

I 

1 

1 

I 

1 

1 

1 

3  NEW.PART 

2* 

3  PART_SUP 

10 

3  CALC_INV 

30 

3INV_REPT 

1 

Part  B:  Analyze  Frequency  of  Use,  Determine  "Owners ' 


c_name 

c_str 

c_cty 

c_st 

c_zip 

c_ph:  30  queries,  0  updates  -  Node  1  (X) 

6  queries,  5  updates  -  Node  2 
part_no:        60  queries,  0  updates  -  Node  1  (X) 

41  queries,  2  updates  -  Node  3 
tot_price:       10  queries,  20  udpates  -  Node  1  (X) 

Part  C:  Merge  Attributes  by  Owner  to  form  Partitions: 

File  1  =  (c_no,  c_nanie,  c_str,  c_cty,  c_st,  c_zip,  _ph,  part_no,  tot_price) 
(No  partitioning  done  for  this  file) 

Part  D:  Modify  Partitions  (if  desired) 
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3)  Determine  File  Placement  (Algoritlim  Step  3):  Relation  1 ,  File  Name  =  1 


PART  A:  Build  Transaction  Table 


Queries 

Updates 

1 

30 

30 

2 

6 

5 

3 

41 

2 

PARI'  B:   Build  Cost  Tables 


(Netivork  Cost  Table  from  Previous  Step) 
(Nodes) 


I 

? 

"1 

1 

0 

8 

12 

2 

8 

0 

20 

3 

12 

20 

0 

Queries 

1117 

3 

1 

0 

240 

360 

2 

48 

0 

120 

3 

492 

.820- 

0 

540 

1060 

480 

Updates 


1 

? 

3 

1 

0 

240 

360 

2 

40 

0 

100 

3 

24 

40 
280" 

0 
"460 

Total  Costs:  Node  1:  540  +  64  =  604 

Node  2:  1060+280  =  1340 
Node  3;  480  +  460  =  940 


***  Recommended  File  Placement  at  Node  1  *• 
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4)  Determine  PrnfiUbility  of  Replication  (Algorithm  Step  4):  File  1 


^000^ 


a)  Node  !  and  2; 


Query  Cost:  Update  Cost: 

Node  1,2  =  0         Node  1  =  240 
Node    3  =  492       Node  2  =  40 
Node  3  =  64 


Total      =  492        Total    =  344 

Total  Cost  =  492  +  344  =  836  (No  replication) 

b)  Node  1  and  3: 

Query  Cost:  Update  Q>st: 

Node  1,3  =  0       Node  1  =  360 
Node    2  =  48      Node  2  =  140 
Node  3  =  24 


Total      =  48        Total   =  524 

Total  Cost  =  48  +  524  =  572   "  Replicate  •' 

c)  Continue  to  Branch,  Nodes  1,2,3: 

Query  Cost:  Update  Cost: 

Nodes  1,2,3  =  0  Node  1  =  600 
Node  2  =  140 
Node  3  =  64 


Total        =  0  Total   =  804 

Total  Cost  =  0  +  804  =  804  (No  RepUcation) 

•"  Recommended  File  Placement  for  File  1;  NODE  l.NODE  3  ••• 
TOTAL  COST  =  572 
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5)  Perform  Sensitivity  Analysts  (if  desired) 

(Algorithm  Step  5) 


6)  •*  Final  Placement  for  File  I  *« 


File  1 
(30  querii 
30  updates) 


f        Node  2         j 


Network  Trafflc: 

Nodes  1-2:  6  queries 
5  updates 
(file  1) 
Nodes  1-3:  32  updates 
(file  i) 


File  1  (41  queries, 
2  updates) 
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7)  Create  Partitions/Files:  Relation  2 

(Algorithm  Step  2) 

Part  A:  Build  Document  Table  (Sort  by  Node) 


made 

ord 

part 

qty 

made 

qty 

cord 

part 

to 

to 

qty 

part 

Node:Doc 

no. 

made 

date 

ord 

date 

desc 

date 

date 

rem 

price 

1  CUST.ORD 

20 

20* 

20* 

I  HIST.FFLE 

30 

30 

30 

30 

1  rNVOICE 

10 

10 

10 

10 

3  NEW_PART 

2' 

2* 

2* 

3  PART.SUP 

10 

10* 

10* 

10 

10 

3  CALC_INV 

30 

30 

30 

30 

30 

30* 

30' 

30* 

3  nsrv_REFr 

1 

1 

1 

1 

Part  B:  Analyze  Frequency  of  Use,  Determine   Owners" 


qty_made 
made.date: 

qty_ord 
cord_date; 


part_desc: 

made_to_date 

ord_to_date 

qty_rem: 

part_price: 


30  queries,  10  updates  -  Node  3  (X) 


40  queries,  20  updates  -  Node  1  (X) 
30  queries,  0  updates  -  Node  3 

1 1  queries,  2  updates  -  Node  3  (X) 


1  query,  30  updates  -  Node  3  (X) 

40  queries,  0  updates  -  Node  1  (X) 
10  queries,  2  udpates  -  Node  3 


Part  C:  Merge  Attributes  by  Owner  to  form  Partitions: 

File  2  =  (part_no,  qty_ord,  cord_date,  part_price) 
File  3  =  (part_no,  qty_made,  made_date,  part_desc, 
made_to_ciate,  ord_to_date,  qty_rem) 

Part  D:  Modify  Partitions  (if  desired) 
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8)  Determine  File  Placement:  Relation  2,  File  Name=  2 
(Algorithm  Step  3) 

Part  A:  Build  Transaction  Table:  File  2  (algorithm  step  4) 
File  2  =  (part_no,  qty_ord,  cord_date,  part_price) 


Queries 

Updates 

I 

40 

20 

2 

0 

0 

3 

40 

2 

PartB:   Build  Cost  Tables 


(Network  Cost  Table  from  Previous  Step) 
(Nodes) 


1 

2 

3 

1 

0 

8 

12 

2 

8 

0 

20 

3 

12 

20 

0 

Queries 


Updates 


1 

2 

3 

1 

0 

320 

480 

2 

0 

0 

0 

3 

480 

800 

0 

480 

1120 

480 

1 

2 

3 

1 

0 

160 

240 

2 

0 

0 

0 

i 

24 

40 

0 

24 

200 

240 

Total  Costs:  Node  1:  480  +  24  =  504 

Node  2:  1120+200  =  1320 
Node  3:  480  +  240  =  720 


*•*  Recommended  File  Placement  at  Node  I  ** 
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9)  Determine  Profitability  of  Replication:  File  2 
(algorithm  step  4) 

100                  010             001 
110                  iOl         '                  ' 

a)  Node  1  and  2: 

Ouerv  Cost: 

Update  Cost: 

Node  1,2  =  0 

Node  1  =  160 

Node    3  =  480 

Node  2  =  200 

Node  3  =  64 

Total      =  480 

Total   =  224 

Total  Cost  =  480  +  224  =  704  (No  replication) 

b)  Node  1  and  3: 

Ouerv  Cost; 

Update  Cost: 

Node  1,3  =  0 

Node  1  =  240 

Node    2  =  0 

Node  2  =  0 

Node  3  =  24 

Total      =  0 

Total   =  264 

Total  Cost  =  0  +  264  =  264   "  Replicate  " 
c)  Continue  to  Branch,  Nodes  1,2,3: 

Oueiv  Cost: 

Update  Cost: 

Nodes  1,2,3  =  0 

Node  1  =  400 

Node  2  =  0 

Node  3  =  64 

Total        =  0 

Total   =  464 

Total  Cost  =  0  +  464  =  464  (No  RepUcation) 

•  Recommended  File  Placement  for  File  2:  NODE  l.NODE  3  • 
TOTAL  COST  =  264 

-58 

10)  Perform  Sensitivit}'  Analysis  (if  desired) 

(Algorithm  Step  5) 


11)  ••  Final  Placement  For  File  1  and  File  2  ** 


File  1 

(30  qiieriesX 

30  updates) 

File  2 

(40  queries, 

20  udpates) 


(       Node  2  ) 


Network  Traflic: 

Nodes  1-2:  6  queries 
5  updates 
(file  1) 
Nodes  1-3:  32  updates 
(file  1) 
30  updates 
(file  2) 


f         Node  3        j 

File  1  (41  queries, 
2  updates) 


File  2  (40  queries, 
2  updates) 
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12)  Determine  Hie  Placement:  Relaiion  2,  File  Name=  3 
(Algorithm  Step  3) 

Part  A:  Build  Transaction  Table:  File  3  (algoritlim  step  4) 

File  3  =  {part_no,  qty_made,  made_date,  pait_desc, 

made_to_date,  ord_to_date,  qty_rem) 


Queries 

Updates 

1 

0 

0 

2 

L     0 

0 

T^     1 

42 

Part  B:   Build  Cost  Tables 


(Network  Cost  Table  from  Previous  Step) 
(Nodes) 


1 

? 

3 

1 

0 

8 

12 

2 

8 

0 

20 

3 

12 

20 

0 

Queries 


Updates 


1 

? 

3 

1 

0 

0 

0 

2 

0 

0 

0 

3 

.12.. 

-2Q, 

.0. 

12 

20 

0 

1 

? 

3 

1 

0 

0 

0 

2 

0 

0 

0 

3 

504 

840 

0 

504 

840 

0 

Total  Costs:  Node  1:  12  +  504  =  516 
Node  2:  20  +  840  =  860 
Node  3:  0  +  0  =  0 


**•  Recommended  File  Placement  at  Node  3  •• 


I'. 
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13)  Determine  Profitability  of  Replication:  File  3 

(algorithm  step  4) 


*'  Cost  is  zero:  replication  not  profitable  ** 


'  Recommended  Hie  Placement  for  File  3:  NODE  3  ' 
TOTAL  COST  =  0 


14)  Perform  sensitivity  analysis  (if  desired) 

(Algorithm  Step  5) 
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15)  ••  FINAL  OUTPUT  OF  DIST.DATA  TOOL  •« 


File  I :  at  Nodes  1  and  3  Cost  =  572 
File  2:  at  Nodes  1  and  3  Cost  =  264 
File  3:  at  Nodes  2  Cost  i=  0 


Total  Ci)st 


836 


Node  2 


Filel 

(30  queries' 
30  updates) 

File  2 

(40  qtjeries, 

20  udpates) 


Network  Traffic: 

Nodes  1-2:  6  queries 
5  updates 
(file  1) 
Nodes  1-3:  32  updates 
(file  1) 
30  updates 
(file  2) 


File  1  (41  queries, 
2  updates) 

File  2  (30  queries, 
10  updates) 

File  3(1  query, 
42  updates) 
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APPENDIX  W 
CODE  LISTING:  DIST_DATA.C 


41    10 
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Distributed  information  systems  are  systems  which  locally  or  geogieiphically  distribute  elements 
within  a  computing  system.  In  this  environment  the  database  designer  is  faced  with  many  new 
design  problems,  one  of  the  most  critical  being  the  distribution  of  data  which  most  accurately 
reflects  the  processing  needs  of  the  organization. 

The  main  focus  of  this  report,  is  in  the  design  of  an  interactive  system  that  automates  the 
partitioning  of  files  and  their  placement  onto  a  distributed  network.  A  review  of  the  relevant 
literature  is  first  presented,  followed  by  an  overview  of  the  system's  design.  Finally,  an  extensive 
example  is  provided  to  demonstrate  the  fiuictionaUt>  and  use  of  the  interactive  system. 
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